Speech Recognition Apparatus

ABSTRACT

A voice recognition system ( 10 ) for improving the toughness of voice recognition for a voice input for which a deteriorated feature amount cannot be completely identified. The system comprises at least two sound detecting means ( 16   a   , 16   b ) for detecting a sound signal, a sound source localizing unit ( 21 ) for determining the direction of a sound source based on the sound signal, a sound source separating unit ( 23 ) for separating a sound by the sound source from the sound signal based on the sound source direction, a mask producing unit ( 25 ) for producing a mask value according to the reliability of the separation results, a feature extracting unit ( 27 ) for extracting the feature amount of the sound signal, and a voice recognizing unit ( 29 ) for applying the mask to the feature amount to recognize a voice from the sound signal.

TECHNICAL FIELD

The present invention relates to a speech recognition apparatus and inparticular it relates to a speech recognition apparatus that is robustto such speech that tends to deteriorate due to noises, input devicespecifications and so on.

BACKGROUND OF THE INVENTION

In general, a speech recognition apparatus in a real environmentreceives speech that deteriorates as it is mixed with noise and soundreverberations. The speech may also deteriorate depending on thespecification of an input device. In order to cope with this problem,some approaches have been proposed for improving robustness of speechrecognition by using such techniques as spectral subtraction, blindsource separation and so on. One of such approaches proposed by M. Cookeet al. of Sheffield University is a missing feature theory (“Robustautomatic speech recognition with missing and unreliable acoustic data”,SPEECH COMMUNICATION 34, p. 267-285, 2001 by Martin Cooke et al.). Thisapproach aims at improving robustness of speech recognition byidentifying and masking missing features (that is, deterioratedfeatures) contained in the features of an input speech. This approach isadvantageous in that it requires less knowledge about noises incomparison with the other approaches.

In a missing feature theory, deteriorated features are identified basedon difference from the features of non-deteriorated speech, based onlocal SN ratio of spectrum or based on an ASA (Auditory Scene Analysis).The ASA is a method of grouping components of the features by utilizingcertain clue that is commonly included in sounds that are radiated fromthe same sound source. Such clue is, for example, harmonic structure ofspectrum, synchronization of on-set, position of the source or the like.Speech recognition includes several methods such as a method ofrecognizing speech by estimating original features for a masked portionand a method of recognizing speech by generating a sound modelcorresponding to masked features.

SUMMARY OF THE INVENTION

In the missing feature theory, there is often a difficulty inidentifying deteriorated features when improvement of robustness ofspeech recognition is intended. The present invention proposes a speechrecognition apparatus for improving robustness of speech recognition fora speech input with which deteriorated features cannot be completelyidentified.

The present invention provides a speech recognition apparatus forrecognizing speechs from sound signals that are collected from theoutside. The apparatus has at least two sound detecting means fordetecting the sound signals, a sound source localization unit fordetermining the direction of a sound source based on the sound signals,a sound source separation unit for separating the speeches from thesound signals according to the sound sources based on the direction ofthe sound sources, a mask generation unit for generating a value of amask according to reliability of the result of separation, a featureextraction unit for extracting features of the sound signals, and aspeech recognition unit for recognizing the speeches from the soundsignals by applying the mask to the features.

According to the invention, robustness of speech recognition can beimproved because the value of the mask is generated according to thereliability of the result of separation of the speech from the soundsignal by sound source.

According to one aspect of the present invention, the mask generationunit generates the value of the mask according to the degree ofcorrespondence between the result of separation of the sound signalsobtained using a plurality of sound source separating techniques thatare different from the technique used in the sound source separationunit and the result of the separation by the sound source separationunit.

According to another aspect of the present invention, the maskgeneration unit generates the value of the mask according to a pass-bandfor determining that the same sound source as defined by the directionof sound source.

According to a further aspect of the present invention, when there aremultiple sound sources, the mask generation unit generates the value ofthe mask by increasing the reliability of separation result if (thesignal is) closer to only one of the multiple sound sources.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a general view of a speech recognition system including aspeech recognition apparatus in accordance with one embodiment of thepresent invention.

FIG. 2 is a block diagram of a speech recognition apparatus inaccordance with one embodiment of the present invention.

FIG. 3 shows microphones and an epipolar geometry.

FIG. 4 is a graph showing a relation among an inter-microphone phasedifference Δφ derived from an epipolar geometry, a frequency f and asound source direction θ_(s).

FIG. 5 is a graph showing a relation among an inter-microphone phasedifference Δφ derived from a transfer function, a frequency f and asound source direction θ_(s).

FIG. 6 is a graph showing a relation among an inter-microphone soundintensity difference Δρ derived from a transfer function, a frequency fand a sound source direction θ.

FIG. 7 is a graph showing a positional relation between microphones anda sound source.

FIG. 8 is a graph showing a change in time in a sound source directionθ_(s).

FIG. 9 is a graph showing a pass-band function δ(θ).

FIG. 10 is a graph showing a sound source direction θ_(s) and apass-band.

FIG. 11 is a graph showing how to select a sub-band by using a phasedifference Δφ in a sound source separation unit.

FIG. 12 is a graph showing how to select a sub-band by using a soundintensity difference Δφ in a sound source separation unit.

FIG. 13 is a graph showing a function of a mask using a pass-bandfunction.

REFERENCE CODES

-   -   10 Speech recognition apparatus    -   14 Sound source    -   16 Microphones    -   21 Sound source localization unit    -   23 Sound source separation unit    -   25 Mask generation unit    -   27 Feature extraction unit    -   29 Speech recognition unit

DESCRIPTION OF THE PREFERRED EMBODIMENTS 1. Outline

Embodiments of the present invention will be described below withreference to the accompanying drawings. FIG. 1 is a general view of aspeech recognition system including a speech recognition apparatus 10 inaccordance with one embodiment of the present invention.

In this system, as shown in FIG. 1, a body 12 having the speechrecognition apparatus 10 is provided to recognize speech coming from asound source 14 that is located in its circumference. The sound source14 is, for example, a human being or a robot, which produces speech forcommunication. The body 12 is, for example, a mobile robot or anelectrical equipment, which uses speech recognition for an interface.

On both sides of the body 12, there are disposed a pair of microphones16 a, 16 b for collecting speech from the sound source. It should benoted that the position of the microphones 16 a, 16 b is not limited toboth sides of the body 12 but they may be disposed at any other positionrelative to the body 12. Besides, the number of the microphones is notlimited to two but any other number of the microphones more than two maybe used.

In this system, the speech coming from the sound source 14 is collectedby the microphones 16. The collected speech is processed by the speechrecognition apparatus 10 located in the body 12. The speech recognitionapparatus determines the direction of the sound source 14 in order torecognize content of the speech. The body 12 may, for example, perform atask indicated by the content of the speech or may reply with anembedded speaking mechanism.

Now, details of the speech recognition apparatus 10 will be described.FIG. 2 is a block diagram of the speech recognition apparatus 10 inaccordance with one embodiment of the present invention.

A plurality of microphones 16 a, 16 b collect speech coming from asingle or multiple sound sources 14 and deliver the speech containingsound signals to the speech recognition apparatus 10.

A sound source localization unit 21 determines the direction θ_(s) ofthe sound source 14 based on the sound signals that are received withthe microphones 16 a, 16 b. When the sound source 14 and/or theapparatus itself 10 moves, localization of the sound source 14 is tracedwith time. In this embodiment, localization of the sound source isperformed by using a method of epipolar geometry, scattering theory ortransfer function. A sound source separation unit 23 uses the directioninformation θ_(s) of the sound source 14 that is obtained in the soundsource localization unit 21 to separate a sound source signal from theinput signal. In this embodiment, the sound source separation isperformed by combining an inter-microphone phase difference Δφ or aninter-microphone sound intensity difference Δρ (which is obtained usingthe above described epipolar geometry, scattering theory or transferfunction) with a pass-band function that imitates human auditorycharacteristics.

A mask generation unit 25 generates a value of a mask depending onwhether the result of the separation by the sound source separation unit23 is reliable or not. Spectrum of the input signal and/or the result ofthe sound source separation is used for evaluating the reliability ofthe separation result. The mask takes a value of 0 to 1. When the valueis closer to 1, the reliability is higher. Each of the values of themasks that are generated in the mask generation unit is applied to thefeatures of the input signal to be used in the speech recognition.

A feature extraction unit 27 extracts the features from the spectrum ofthe input signal.

A speech recognition unit 29 determines output probability of thefeatures from a sound model to recognize the speech. At this time, themask generated in the mask generation unit 25 is applied in order toadjust the output probability. In this embodiment, the speechrecognition is performed using the Hidden Markov Model (HMM).

Processes performed in each unit of the speech recognition apparatus 10will be described below.

2. Sound Source Localization Unit

The sound source localization unit 21 determines the direction of soundsource 14 based on the sound signals that are received by themicrophones 16 a, 16 b. In addition, when the sound source 14 and/or theapparatus itself 10 moves, the identified position of the sound source14 is traced in time. In this embodiment, localization of the soundsource is performed by using a method selected from a plurality ofmethods including a scheme using an epipolar geometry of the source 14and the microphones 16 (refer to section 2.1.), a scattering theory(refer to section 2.2.) and a transfer function (refer to section 2.3.).It should be noted that the source localization may be performed usingany other known method such as a beam forming method or the like.

2.1. Source Localization Using Epipolar Geometry of Sound Source andMicrophones

This method uses the epipolar geometry of the microphones 16 and thesound source 14, as shown in FIG. 3, in order to calculate the sourcedirection θs. As shown in FIG. 3, the distance between the microphones16 a and 16 b is represented by 2 b. A middle point between bothmicrophones is made an origin and a vertical direction from the originis assumed to be the front.

Details of the epipolar geometry can be seen in an article “Positionlocalization/separation/recognition of multiple sound sources by activeaudition” by Nakadai et al., AI Challenge Study Team, pp. 1043-1049,Association of Artificial Intelligence, 2002.

The sound source localization using the epipolar geometry is performedaccording to the following procedure:

1) The FFT or the like is used to perform a frequency analysis on thesound signal that is received from the microphones 16 a, 16 b to obtainspectra S1(f), S2(f).

2) The obtained spectra are divided into multiple frequency sub-bandsand a phase difference Δφ(f_(i)) of each sub-band f_(i) is obtained inaccordance with Equation (1).

$\begin{matrix}{{{\Delta\phi}\left( f_{i} \right)} = {{{arc}\; {\tan \left( \frac{{Im}\left\lbrack {S\; 1\left( f_{i} \right)} \right\rbrack}{{Re}\left\lbrack {S\; 1\left( f_{i} \right)} \right\rbrack} \right)}} - {\arctan \left( \frac{{Im}\left\lbrack {S\; 2\left( f_{i} \right)} \right\rbrack}{{Re}\left\lbrack {S\; 2\left( f_{i} \right)} \right\rbrack} \right)}}} & (1)\end{matrix}$

where Δφ(f_(i)) indicates an inter-microphone phase difference of f_(i).Im[S1(f _(i))] indicates an imaginary part of the spectrum S1(f _(i)) inthe sub-band f_(i) of the microphone 1. Re[S1(f _(i))] indicates a realpart of the spectrum S1(f _(i)) in the sub-band f_(i) of the microphone1. Im[S2(f _(i))] indicates an imaginary part of the spectrum S2(f _(i))in the sub-band f_(i) of the microphone 2. Re[S2(f _(i))] indicates areal part of the spectrum S2(f _(i)) in the sub-band f_(i) of themicrophone 2.

3) The epipolar geometry (FIG. 3) is used to derive Equation (2).

$\begin{matrix}{{{\Delta\phi}\left( {\theta,f_{i}} \right)} = {\frac{2\pi \; f_{i}}{v} \times {b\left( {\theta + {\sin \; \theta}} \right)}}} & (2)\end{matrix}$

where v indicates the sound speed, b indicates a distance between theorigin and the microphone and θ indicates an angle of the sound sourcedirection.

By assigning to θ in Equation (2) a value, for example, for every 5degrees in a range from −90 degrees to +90 degrees to obtain a relationbetween the frequency f_(i) and the phase difference Δφ as shown in FIG.4. By using the relation as shown in FIG. 4, the angle θ of Δφ (θ,f_(i)) that is closest to Δφ(f_(i)) is determined. This angle θ is thesound source direction θ_(i) of the sub-band f_(i).

4) From the sound source direction θ_(i) and the frequency for eachsub-band, the sub-bands whose source directions are close to each otherand which are in an articulation relation to each other are selected andgrouped. The sound source direction of such group is taken as θ_(s).When a plurality of groups are selected, there is a possibility thatmultiple sound sources exist. In this case, the sound source directionfor each group may be determined. When the number of the sound sourcesis known in advance, it is desirable that the number of the groupscorresponding to the number of the sound sources be selected.

2. 2. Localization of the Sound Source Using the Scattering Theory

This method calculates a sound source direction θ_(s) in considerationof scattered waves by the body 12 having the microphone 16. In thisexample, the body 12 having the microphone 16 is assumed to be a head ofa robot, which forms a sphere having a radius b. Besides, the center ofthe head is regarded as an origin of a polar coordinate (r, θ, φ).

Details of the scattering theory can be seen, for example, in an article“Scattering Theory” by Lax et al., Academic Press, NY., 1989.

The sound source localization by using the scattering theory isperformed according to the following procedure:

1) The FFT or the like is used to perform a frequency analysis upon thesound signal that is input from the microphones 16 a, 16 b to determinespectra S1(f), S2(f).

2) The determined spectra are divided into multiple frequency sub-bandsand a phase difference Δφ(f_(i)) of each sub-band f_(i) is obtained inaccordance with Equation (1). Or, a sound intensity difference Δφ(f_(i))of each sub-band f_(i) is obtained according to Equation (3).

$\begin{matrix}{{\Delta \; {\rho \left( f_{i} \right)}} = {20\mspace{11mu} \log_{10}\frac{{P\; 1\left( f_{i} \right)}}{{P\; 2\left( f_{i} \right)}}}} & (3)\end{matrix}$

where Δρ(f_(i)) indicates a sound intensity difference between the twomicrophones. P1(f _(i)) indicates a power of the sub-band f_(i) of themicrophone 1 and P2(f _(i)) indicates a power of the sub-band f_(i) ofthe microphone 2.

3) Assuming that the position of the sound source 14 is r₀=(r₀, 0, 0),the position of the observation point (the microphone 16) is r=(b, 0, 0)and the distance between the sound source and the observation point isR=|r₀·r|, a potential V^(i) by the direct sound at the head portion ofthe robot is defined as in Equation (4).

$\begin{matrix}{V^{i} = {\frac{v}{2\pi \; {Rf}}^{\frac{2\pi \; {Rf}}{v}}}} & (4)\end{matrix}$

where f indicates the frequency, v indicates the sound speed and Rindicates the distance between the sound source and the observationpoint.

4) A potential S(θ, f) by the direct sound from the sound sourcedirection θ and the scattered sounds at the head portion of the robot isdefined as in Equation (5).

$\begin{matrix}{{S\left( {\theta,f} \right)} = {{V^{i} + V^{s}} = {{- \left( \frac{v}{2\pi \; {bf}} \right)^{2}}{\sum\limits_{n = 0}^{\infty}\; {\left( {{2n} + 1} \right){P_{n}\left( {\cos \; \theta} \right)}\frac{h_{n}^{(1)}\left( {\frac{2\pi \; r_{0}}{v}f} \right)}{h_{n}^{{(1)}\prime}\left( {\frac{2\pi \; b}{v}f} \right)}}}}}} & (5)\end{matrix}$

where V^(s) indicates the potential by the scattered sounds, P_(n)indicates the Legendre function of the first order and h_(n)(1)indicates the spherical Hankel function of the first order.

5) Assuming that the polar coordinate of the microphone 16 a isrepresented by (b, n/2, 0) and the polar coordinate of the microphone 16b is represented by (b, −π/2, 0), potentials of each microphone arerepresented by Equation (6) and Equation (7).

S1(θ,f)=S(π/2−θ,f)  (6)

S2(θ,f)=S(−π/2−θ,f)  (7)

6) The phase difference Δφ(θ, f_(i)) and the sound intensity differenceΔρ(θ, f_(i)) in each sub-band f_(i) are related with the direction θ ofthe sound source by Equation 8 and Equation (9) respectively.

Δφ(θ,f _(i))=arg(S1(θ,f _(i)))−arg(S2(θ,f _(i)))  (8)

$\begin{matrix}{{\Delta \; {\rho \left( {\theta,f_{i}} \right)}} = {20\mspace{11mu} \log_{10}\frac{{S\; 1\left( {\theta,f_{i}} \right)}}{{S\; 2\left( {\theta,f_{i}} \right)}}}} & (9)\end{matrix}$

7) Appropriate values (for every five degrees for example) are assignedto θ in Equation (8) and Equation (9) in advance, so that a relationbetween the frequency f_(i) and the phase difference Δφ(θ, f_(i)) or arelation between the frequency f_(i) and the sound intensity differenceΔρ(θ, f_(i)) are obtained.

8) Among Δφ(θ, f_(i)) or Δρ(θ, f_(i)), θ that is the closest toΔφ(f_(i)) or Δρ(f_(i)) is taken as the sound source direction θ_(i) ofeach sub-band f_(i).

9) From the sound source direction θ_(i) and the frequency for eachsub-band, the sub-bands whose source directions are close each other andwhich are in an articulation relation each other are selected andgrouped. The sound source direction of such group is assumed as θ_(s).When a plurality of groups are selected, there is a possibility thatmultiple sound sources may exist. In this case, the sound sourcedirection for each group may be obtained. When the number of the soundsources is known in advance, it is desirable that the number of thegroups corresponding to the number of the sound sources be selected.Besides, the sound source direction θ_(s) may be obtained by using bothof Δφ(f_(i)) and Δρ(f_(i)).

2. 3. Sound Source Localization Using Transfer Function

Measuring a transfer function is a general method for associating phasedifference and/or sound intensity difference with frequency and soundsource direction. The transfer function is generated through measurementof impulse responses from various directions using the microphones 16 a,16 b installed in the body 12 (which is, for example, a robot). Thistransfer function is used to identify the sound source direction. Thesound source localization using the transfer function is performedaccording to the following procedure:

1) The FFT or the like is used to perform a frequency analysis upon thesound signal that is input from the microphones 16 a, 16 b to determinespectra S1(f), S2(f).

2) The determined spectra are divided into multiple frequency sub-bandsand a phase difference Δφ(f_(i)) of each sub-band f_(i) is obtained inaccordance with Equation (1). Or, a sound intensity difference Δρ(f_(i))of each sub-band f_(i) is obtained according to Equation (3).

3) Impulse responses are measured in an appropriate interval (forexample, for every five degrees) in a range of ±90 degrees to obtain atransfer function. Specifically, an impulse response for each directionθ is measured by the microphones 16 a, 16 b and a frequency analysisusing the FFT or the like is performed on the measured impulse response,so that spectra (transfer functions) Sp1(f), Sp2(f) of each frequency fcorresponding to the impulse response are obtained. By using thefollowing Equation (10) and Equation (11), a phase difference Δφ (θ, f)and a sound intensity difference Δρ(θ, f) are obtained from the transferfunctions Sp1(f), Sp2(f).

Δφ(θ,f)=arg(Sp1(f))−arg(Sp2(f))  (10)

$\begin{matrix}{{{\Delta\rho}\left( {\theta,f} \right)} = {20\mspace{11mu} \log_{10}\frac{{{Sp}\; 1(f)}}{{{Sp}\; 2(f)}}}} & (11)\end{matrix}$

Calculations using Equation (10) and Equation (11) are performed inassociation with the direction θ in an arbitrary interval and thearbitrary frequency f in a range of ±90 degrees. Examples of thecalculated phase difference Δφ (θ, f) and sound intensity differenceΔρ(θ, f) are shown in FIG. 5 and FIG. 6.

4) By using the relation as shown in FIG. 5 or FIG. 6, the angle θ thatis closest to Δφ (f_(i)) or Δρ(f_(i)) is determined. This θ is the soundsource direction θ_(i) of each sub-band f_(i).

5) From the sound source direction θ_(i) and the frequency for eachsub-band, the sub-bands whose source directions are close to each otherand which are in an articulation relation to each other are selected andgrouped. The sound source direction of such group is assumed as θ_(s).When a plurality of groups are selected, there is a possibility thatmultiple sound sources exist. In this case, the sound source directionfor each group may be determined. Besides, the sound source directionθ_(s) may be determined using both of Δφ(f_(i)) and Δρ(f_(i)).

2. 4. Sound Source Localization Using a Cross-Correlation of InputSignals of Microphones

This method determined a difference (d in FIG. 7) in distances from thesound source 14 to the microphone 16 a and the microphone 16 b based ona correlation of the input signals of the microphones 16 a and 16 b andestimates the sound source direction θ_(s) from a relation between theobtained distance d and the inter-microphone distance 2 b. This methodis performed according to the following procedure:

1) A cross-correlation CC(T) of the input signals to the microphone 16 aand the microphone 16 b is calculated by using Equation (12).

$\begin{matrix}{{{CC}(T)} = {\int_{0}^{T}{{x_{1}(t)}{x_{2}\left( {t + T} \right)}{t}}}} & (12)\end{matrix}$

where T indicates a frame length. x₁(t) indicates an input signal thatis extracted in the frame length T relative to the microphone 16 a.x₂(t) indicates an input signal that is extracted in the frame length Trelative to the microphone 16 b.

2) Peaks are extracted from the calculated cross-correlation. It isdesirable that the number of the extracted peaks be equal to the numberof sound sources when the number is known in advance. Positions of theextracted peaks on a time axis indicate an arrival time lag of thesignals to the microphone 16 a and the microphone 16 b.

3) A difference (d in FIG. 7) between the distances from the soundsource 14 to the microphone 16 a and 16 b is calculated based on thearrival time lag of the signals and the sound speed.

4) As shown in FIG. 7, the inter-microphone distance 2 b and thedifference d in the distances from the sound source to the microphonesare used to calculate the direction θ_(s) of the sound source 14 fromEquation (13).

θ_(s)=arcsin(d/2b)  (13)

When a plurality of peaks are extracted, each sound source directionθ_(s) for each peak is obtained.

2. 5. Trace of Sound Source Direction

When the sound source 14 and/or the body 12 move, the sound sourcedirection is traced. FIG. 8 shows a change in time in the sound sourcedirection θ_(s). The trace is performed as follows. The angle θ_(s) thatis actually obtained is compared with the sound source direction Op thatis predicted from the track of θ_(s) before that time point. When thedifference is smaller than a predetermined threshold value, it isdetermined that the signals are from the same sound source. When thedifference is larger than the threshold value, it is determined that thesignals are from different sound sources. The prediction is performed byusing a known prediction method for time series of signals such as theKalman filter, an auto-regression prediction, the HMM or the like.

3. Sound Source Separation Unit

The sound source separation unit 23 uses the direction θ_(s) of thesound source 14 obtained in the sound source localization unit 21 toseparate the sound source signals from the input signals. The separationin accordance with this embodiment is performed by combining theinter-microphone phase difference Δφ or the inter-microphone soundintensity difference Δρ obtained using the above-described epipolargeometry, scattering theory or transfer function with a pass-bandfunction that imitates a human auditory feature. However, any otherknown method for separating the sound source signals using the soundsource direction and separating the sound source for each sub-band suchas a beam forming method and a GSS (Geometric Source Separation) methodmay be used in the sound source separation unit 23. When the soundsource separation is performed in a time domain, the signals aretransformed into a frequency domain after the separation process. Thesound source separation in this embodiment is performed according to thefollowing procedure:

1) The sound source direction θ_(s) and the phase difference Δφ(f_(i))or the sound intensity difference Δρ(f_(i)) of the sub-band f_(i) of thespectrum of the input signal are received from the sound sourcelocalization unit 21. When the technique for localizing the sound sourcein the frequency domain is not used in the sound source separation unit23, Δφ(f_(i)) or Δρ(f_(i)) is obtained at this point using Equation (1)or Equation (3).

2) A pass-band function indicating a relation between a sound sourcedirection and a pass-band is used to obtain a pass-band δ(θ_(s))corresponding to the sound source direction θ_(s) that is obtained inthe sound source localization unit 21.

The pass-band function is designed based on a human auditorycharacteristic that a resolution relative to the sound source directionis higher in the front direction but lower in the periphery. Therefore,for example, as shown in FIG. 9, the pass-band is set to be narrower inthe front direction but wider in the periphery. The horizontal axisrepresents a level line in case when the front of the body 12 is assumedas 0 [deg].

3) From the obtained δ(θ_(s)), a lower limit θ_(l) and an upper limitθ_(h) of the pass-band (as exemplarily illustrated in FIG. 8) arecalculated by using Equation (14).

θ_(l)=θ_(s)+δ(θ_(s))

θ_(h)=θ_(s)+δ(θ_(s))  (14)

4) Phase differences Δφ_(l) and Δφ_(h) corresponding to θ_(l) and θ_(h)respectively are estimated using either of the above-described epipolargeometry (Equation (2) and FIG. 4), scattering theory (Equation (8)) andtransfer function (FIG. 5). FIG. 11 is a graph showing an example of therelation between the estimated phase difference and the frequency f_(i).Or, the sound intensity differences Δρ_(l) and Δρ_(h) corresponding toθ_(l) and θ_(h) are estimated using either of the above-describedscattering theory (Equation (9)) and transfer function (FIG. 6). FIG. 12is a graph showing an example of the relation between the estimatedsound intensity difference and the frequency f_(i).

5) It is checked whether Δφ(f_(i)) or Δρ(f_(i)) of each sub-band islocated within the pass-band in order to select those which exist withinthe pass-band (FIG. 11 and FIG. 12). It is generally known thatprecision of separation is higher if phase difference is used for soundsource localization with lower frequency. It is also known thatprecision of separation is higher if sound intensity difference is usedfor sound source localization with higher frequency. Accordingly, withthe sub-band lower than a predetermined threshold value (for example,1500 [Hz]), the phase difference Δφ may be selected, and with thesub-band higher than the threshold value, the sound intensity differenceΔρ may be selected.

6) Flags of the selected sub-bands are set to 1 and flags of theunselected sub-bands are set to 0. The sub-bands having a flag of 1 areseparated as the sound source signals.

Although the above-described sound source separation is performed withthe spectra in a linear frequency domain, spectra in a mel frequencydomain may be used alternatively. The mel frequency is a sensory measureof a human being for high/low of the sound. Its value almost correspondsto a logarithm of an actual frequency. In this case, the sound sourceseparation in the mel frequency domain is performed after Step 1) in theabove-described process by the sound source separation unit 23 accordingto the following procedure in which a filtering process for convertingthe signals into the mel frequency domain is added.

1) Spectra S1(f), S2(f) are obtained by performing a frequency analysisupon the signals that are input to the microphones 16 a, 16 b by usingthe FFT or the like.

2) A filter bank analysis is performed by triangle windows (for example,24 pieces) spaced evenly in the mel frequency domain.

3) A phase difference Δφ(m_(j)) of each sub-band m_(j) of the obtainedmel frequency domain spectrum is obtained according to Equation (1)(where f_(i)→m_(j)). Or, an inter-microphone sound intensity differenceΔρ(m_(j)) is obtained according to Equation (3) (where f_(i)→m_(j)).

4) The pass-band function (FIG. 9) representing a relation between thesound source direction and the pass-band is used to obtain a pass-bandδ(θ_(s)) corresponding to the sound source direction θ_(s) that isobtained in the sound source localization unit 21.

5) From the obtained δ(θ_(s)), a lower limit θ_(l) and an upper limitθ_(h) of the pass-band are calculated by using Equation (14).

6) Phase differences Δφ_(l), Δφ_(h) corresponding to θ_(l), θ_(h) areestimated by using either of the above-described epipolar geometry(Equation (2) and FIG. 4), scattering theory (Equation (8)) and transferfunction (FIG. 5). Or, sound intensity differences Δρ_(l), Δρ_(h)corresponding to θ_(l), θ_(h) are estimated by using either of theabove-described scattering theory (Equation (9)) and transfer function(FIG. 6).

7) It is checked whether Δφ(m_(j)) or Δρ(m_(j)) of each mel frequency islocated within the pass-band in order to select those which exist withinthe pass-band. It is generally known that precision of separation ishigher if the phase difference is used for localization with lowfrequency, and is higher if the sound intensity difference is used forlocalization with high frequency. Accordingly, with the sub-band lowerthan a predetermined threshold value (for example, 1500 [Hz]), the phasedifference Δφ may be selected, and with the sub-band higher than thethreshold value, the sound intensity difference Δρ may be selected.

8) Flags of the selected mel frequencies are set to 1 and flags of theunselected mel frequencies are set to 0. The mel frequencies having aflag of 1 are regarded as the separated signals.

When the sound source separation is performed in the mel frequencydomain, conversion into the mel frequency in a mask generation unit 25(to be described later) is not required.

4. Mask Generation Unit

The mask generation unit 25 generates a value of a mask according toreliability of the result of the separation of the sound sourceseparation unit 23. In this embodiment, either one of the schemes may beused, which include a mask generation scheme using the information froma plurality of sound source separation method (section 4.1), a maskgeneration scheme using the pass-band function (section 4.2) and a maskgeneration scheme considering influences by a plurality of sound sources(section 4.3). The mask generation unit 25 examines reliability of theflag (0 or 1) that is set in the sound source separation unit 23 toestablish a value of the mask in consideration of the flag value and thereliability. The mask is assigned a value of 0 to 1. As the value iscloser to 1, the reliability is higher.

4. 1. Mask Generation Using Information from a Plurality of Sound SourceSeparation Methods

In this process, by using results of signal separation by a plurality ofsound source separation methods, the mask generation unit 25 confirmsreliability of the separation result of the sound source separation unit23 so as to generate the mask. This process is performed according tothe following procedure:

1) Sound source separation is performed using at least one sound sourceseparation technique that is not used by the sound source separationunit 23 to establish a flag for each sub-band in the same manner as inthe sound source separation unit 23. In this embodiment, the soundsource separation by the sound source separation unit 23 is performed byusing either of the following factors:

-   -   i) phase difference based on epipolar geometry    -   ii) phase difference based on scattering theory    -   iii) sound intensity difference based on scattering theory    -   iv) phase difference based on transfer function    -   v) sound intensity difference based on transfer function

2) The mask generation unit 25 examines whether the flags obtained inthe sound source separation unit 23 correspond to the flags obtained inthe above process 1) respectively in order to generate the mask. Forexample, assuming that (i) the phase difference based on the epipolargeometry is used in the technique of the sound source separation unit 23and that (ii) the phase difference based on the scattering theory, (iii)the sound intensity difference based on the scattering theory and (v)the sound intensity difference based on the transfer function are usedin the mask generation unit 25, the value of the mask in each situationis generated as follows:

TABLE 1 Flag of (i) Flags of (ii), (iii), (v) Mask Value 0 all 0s 0 0two 0s ⅓ 0 one or no 0s 1 1 all 1s 1 1 two 1s ⅓ 1 one or no1s 0

3) A filter bank analysis of a mel scale is performed on the obtainedmask value so as to convert the mask value into a value of a melfrequency axis, so that a mask value can be generated. It should benoted that when the sound source separation is performed in the melfrequency domain as described above, this step is not needed.

Besides, the mask value that has been converted to the mel frequencyaxis may be converted to a binary mask value that has a value of 1 whenthe converted mask value exceeds a predetermined appropriate thresholdvalue and a value of 0 when it does not exceed the threshold value.

4. 2. Mask Generation Using the Pass-Band Function

In this method, the mask value is generated based on closeness from thesound source direction by using the sound source direction θ_(s) and thepass-band function δ(θ_(s)). Specifically, it is regarded thatreliability of the flag having a value of 1 assigned by the sound sourceseparation unit 23 is higher when the sound source direction is closerwhereas reliability of the flag having a value of 0 assigned by thesound source separation unit 23 is higher when the sound sourcedirection is further. This process is performed according to thefollowing procedure:

1) The sound source direction θ_(s) and the input signal are receivedfrom the sound source localization unit 21.

2) The sound source direction θ_(i) of each sub-band is obtained fromthe input signal (when the sound source direction has been obtained inthe sound source localization unit 21, that direction is used).

3) The pass-band δ(θ_(s)) and the flag of each sub-band f_(i) arereceived from the sound source separation unit 23 (which will behereinafter represented by θ_(t)).

4) A function of mask is formed by using θ_(t) and a temporary mask isgenerated in comparison with θ_(i) of each sub-band. This function isgiven as in Equation (15) and its behavior is shown in FIG. 13.

$\begin{matrix}{{{Temporary}\mspace{14mu} {Mask}} = \left\{ \begin{matrix}1 & \left( {{- \pi} \leq \theta_{i} < {\theta_{s} - {2\theta_{t}}}} \right) \\{{- \frac{\theta_{i} - \theta_{s}}{\theta_{t}}} - 1} & \left( {{\theta_{s} - {2\theta_{t}}} \leq \theta_{i} < {\theta_{s} - \theta_{t}}} \right) \\{\frac{\theta_{i} - \theta_{s}}{\theta_{t}} + 1} & \left( {{\theta_{s} - \theta_{t}} \leq \theta_{i} < \theta_{s}} \right) \\{{- \frac{\theta_{i} - \theta_{s}}{\theta_{t}}} + 1} & \left( {\theta_{s} \leq \theta_{i} < {\theta_{s} + \theta_{t}}} \right) \\{\frac{\theta_{i} - \theta_{s}}{\theta_{t}} - 1} & \left( {{\theta_{s} + \theta_{t}} \leq \theta_{i} < {\theta_{s} + {2\theta_{t}}}} \right) \\1 & \left( {{\theta_{s} + {2\theta_{t}}} \leq \theta_{i\;} < \pi} \right)\end{matrix} \right.} & (15)\end{matrix}$

5) The mask is generated as shown in Table 2 based on the flag obtainedin the sound source separation unit 23 and the temporary mask obtainedin the above step 4).

TABLE 2 Flag Temporary Mask Mask Value 0 1 0 0 1 > Temp Mask > 0 Valueof Temp Mask 0 0 1 1 1 1 1 1 > Temp Mask > 0 Value of Temp Mask 1 0 0

6) A filter bank analysis of a mel scale is performed on the obtainedmask value so as to convert the mask value into a value of a melfrequency axis, so that a mask value can be generated. It should benoted that when the sound source separation is performed in the melfrequency domain as described above, this step is not needed.

Besides, the mask value that has been converted to the mel frequencyaxis may be converted to a binary mask value that has a value of 1 whenthe converted mask value exceeds a predetermined appropriate thresholdvale and a value of 0 when it does not exceed the threshold value.

4. 3. Mask Generation Considering Influences by a Plurality of SoundSources

In a case of a plurality of sound sources, the mask is such generated todecrease the reliability of the sub-band when it is estimated that thesignals from at least two sound sources are included in the concernedsub-band.

1) The sound source direction θ_(s1), θ_(s2), . . . and the input signalare received from the sound source localization unit 21.

2) The sound source direction θ_(i) of each sub-band is obtained fromthe input signal. When the sound source direction has been obtained inthe sound source localization unit 21, that direction is used.

3) The pass-bands (θ_(l1), θ_(h1)), (θ_(l2), θ_(h2)), . . . of eachsound source direction θ_(s1), θ_(s2), . . . and the flags are receivedfrom the sound source separation unit 23.

4) It is examined:

-   -   (i) whether the sound source direction θ_(i) of each sub-band is        included in the pass-band (θ_(l), θ_(h)) of two or more sound        sources; or    -   (ii) whether the sound source direction θ_(i) of each sub-band        is not included even in the pass-band of that sound source.

When either (i) or (ii) is true, a temporary mask having a value of 0 isgenerated as for the sub-band whereas a temporary mask having a value of1 is generated as for the sub-bands in the other cases.

5) A mask is generated as shown in table 3 according to the flag and thetemporary mask.

TABLE 3 Flag Temp Mask Mask Value 0 1 0 0 0 1 1 1 1 1 0 0

6) A filter bank analysis of a mel scale is performed on the obtainedmask value so as to convert the mask value into a value of a melfrequency axis, so that a mask value can be generated. It should benoted that when the sound source separation is performed in the melfrequency domain as described above, this step is not needed.

Besides, the mask value that has been converted to the me frequency axismay be converted to a binary mask value that has a value of 1 when theconverted mask value exceeds a predetermined appropriate threshold valeand a value of 0 when it does not exceed the threshold value.

5. Feature Extraction Unit

The feature extraction unit 27 determines features from the spectrum ofthe input signal using a known technique. This process is performedaccording to the following procedure:

1) The spectrum is obtained by using the FFT or the like.

2) A filter bank analysis is performed through triangle windows (forexample, 24 pieces) spaced evenly in the mel frequency domain.

3) A logarithm of the analysis result is calculated to obtain a melfrequency logarithm spectrum.

4) A discrete cosine conversion is performed to the logarithm spectrum.

5) The terms of zero-order and higher orders (for example, 13th to 23rd)of cepstrum coefficients are set to zero.

6) Cepstrum mean suppression (CMS) is performed.

7) An inverse discrete cosine transform is performed.

The obtained features are represented by feature vector x=(x₁, x₂, . . ., x_(j), . . . , x_(J)).

6. Speech Recognition Unit

In this embodiment, the speech recognition unit 29 performs a speechrecognition by using the HMM that is known as a conventional technique.

When the vector of feature is x and the state is S, an outputprobability f(x, S) of the usual continuous distribution type of HMM isrepresented by Equation (16).

$\begin{matrix}{{f\left( x \middle| S \right)} = {\sum\limits_{k = 1}^{N}\; {{P\left( k \middle| S \right)}{f\left( {\left. x \middle| k \right.,S} \right)}}}} & (16)\end{matrix}$

where N represents the number of mixtures of normal distribution andP(k|S) represents a mixture ratio.

The speech recognition based on the missing feature theory uses acalculation result of averaging f(x, S) by a probability densityfunction p(x) of x.

$\begin{matrix}{\overset{\_}{f\left( x \middle| S \right)} = {\sum\limits_{k = 1}^{N}\; {{P\left( k \middle| S \right)}{f\left( {\left. x_{r} \middle| k \right.,S} \right)}}}} & (17)\end{matrix}$

In Equation (17), x=(x_(r), x_(u)) is assumed where x_(r) representsreliable components of the vector of feature (the value of their mask islarger than 0) and x_(u) represents unreliable components of the vectorof feature (the value of their mask is 0).

Assuming that the unreliable components of the feature are distributedevenly in a range of [0, x_(u)], Equation (17) can be re-written as inEquation (18).

$\begin{matrix}{\overset{\_}{f\left( x \middle| S \right)} = {\sum\limits_{k = 1}^{N}\; {{P\left( k \middle| S \right)}{f\left( {\left. x_{r} \middle| k \right.,S} \right)}\frac{1}{x_{u}}{\int_{0}^{x_{u}}{{f\left( {\left. x_{r}^{\prime} \middle| k \right.,S} \right)}{x_{u}^{\prime}}}}}}} & (18)\end{matrix}$

An output probability o(xj|S) of the j-th component of x can beexpressed as in Equation (19).

$\begin{matrix}{{o\left( x_{j} \middle| S \right)} = \left\{ \begin{matrix}{{{M(j)}{f\left( x_{j} \middle| S \right)}} + {\left( {1 - {M(j)}} \right)\overset{\_}{f\left( x_{j} \middle| S \right)}}} & {{{{ifM}(j)} \neq 0}\mspace{14mu}} \\1 & {otherwise}\end{matrix} \right.} & (19)\end{matrix}$

where M(j) represents the mask of the j-th component in the vector offeature.

An overall output probability o(x|S) can be expressed as in Equation(20).

$\begin{matrix}{{o\left( x \middle| S \right)} = {\sum\limits_{k = 1}^{N}\; {{P\left( k \middle| S \right)}\exp \left\{ {\sum\limits_{j = 1}^{J}\; {{M(i)}\log \; {f\left( {\left. x_{i} \middle| k \right.,S} \right)}}} \right\}}}} & (20)\end{matrix}$

where J represents a dimension of the vector of feature.

Equation (20) can be also expressed as in Equation (21).

$\begin{matrix}{{o\left( x \middle| S \right)} = {\sum\limits_{k = 1}^{N}\; {{P\left( k \middle| S \right)}\exp \left\{ {\sum\limits_{j = 1}^{J}\; {{M(i)}\log \; {f\left( {\left. x_{i} \middle| k \right.,S} \right)}}} \right\}}}} & (21)\end{matrix}$

The speech recognition is performed by using either Equation (20) orEquation (21).

Although the present invention has been described above with referenceto the specific embodiments, the present invention is not limited tosuch specific embodiments.

1-4. (canceled)
 5. A speech recognition apparatus for recognizing speechfrom sound received from outside, the apparatus comprising: at least twosound detectors for detecting sound; means for localizing a sound sourcebased on the sound, said means for localizing determining a direction ofthe sound source; a first means for separating a speech based on thedetermined direction of the sound source; means for generating a maskaccording to reliability of a result of separation by said means forseparating a speech; means for extracting features of the sound; andmeans for recognizing speech from the sound by applying the mask to theextracted features.
 6. The speech recognition apparatus as claimed inclaim 5, wherein said means for generating a mask comprises: a secondmeans for separating a speech according to the sound source from thesound based on the determined direction of the sound source usingdifferent source separating scheme than the one used in said first meansfor separating a speech; means for comparing the results of separationby said first means for separating a speech and said second means forseparating a speech; and means for assigning masking values to sub-bandsof the speech according to the comparison of the results.
 7. The speechrecognition apparatus as claimed in claim 6, wherein said first meansfor separating a speech comprises: means for identifying frequencysub-bands of the speech whose phase difference and/or sound intensitydifference is within a pass-band.
 8. The speech recognition apparatus asclaimed in claim 5, wherein said means for generating a mask generatesthe value of the mask according to a pass-band that is identifiedaccording to the sound source direction and used for determining whetheror not the sound are from the same sound source.
 9. The speechrecognition apparatus as claimed in claim 5, wherein when there aremultiple sound sources, said means for generating a mask assigns ahigher value to the sub-band of the sound that is closer to only one ofthe multiple sound sources.
 10. A method for recognizing sound receivedby at least two sound detectors, comprising: localizing a sound sourcebased on the sound, and determining a direction of the sound source; afirst step of separating a speech based on the determined direction ofthe sound source; generating a mask according to reliability of a resultof separation by said means for separating a speech; extracting featuresof the sound; and recognizing speech from the sound by applying the maskto the extracted features.
 11. The method as claimed in claim 10,wherein said generating a mask comprises: a second step of separating aspeech according to the sound source from the sound based on thedetermined direction of the sound source using different sourceseparating scheme than the one used in said first step of separating aspeech; comparing the results of separation by said first step ofseparating a speech and said second step of separating a speech; andmeans for assigning masking values to sub-bands of the speech accordingto the comparison of the results.
 12. The method as claimed in claim 11,wherein said first step of separating a speech comprises: identifyingfrequency sub-bands of the speech whose phase difference and/or soundintensity difference is within a pass-band.